{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Pjmz-RORV8E"
      },
      "source": [
        "# Translate text between languages\n",
        "\n",
        "This notebook covers machine translation backed by Hugging Face models. The quality of machine translation via cloud services has come a very long way and produces high quality results. This notebook shows how the models from Hugging Face give developers a reasonable alternative for local machine translation."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dk31rbYjSTYm"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies. Since this notebook is using optional pipelines, we need to install the pipeline extras package."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XMQuuun2R06J"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PNPJ95cdTKSS"
      },
      "source": [
        "# Create a Translation instance\n",
        "\n",
        "The Translation instance is the main entrypoint for translating text between languages. The pipeline abstracts translating text into a one line call! \n",
        "\n",
        "The pipeline has logic to detect the input language, load the relevant model that handles translating from source to target language and return results. The translation pipeline also has built-in logic to handle splitting large text blocks into smaller sections the models can handle.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nTDwXOUeTH2-"
      },
      "source": [
        "%%capture\n",
        "\n",
        "from txtai.pipeline import Translation\n",
        "\n",
        "# Create translation model\n",
        "translate = Translation()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-vGR_piwZZO6"
      },
      "source": [
        "# Translate text\n",
        "\n",
        "The example below shows how to translate text from English to Spanish. This text is then translated back to English."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "-K2YJJzsVtfq",
        "outputId": "44df5404-ea14-4746-fc8b-a2e205bd9466"
      },
      "source": [
        "translation = translate(\"This is a test translation into Spanish\", \"es\")\n",
        "translation"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'Esta es una traducción de prueba al español'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 16
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "K_UnAZQpetM8",
        "outputId": "46c9c68c-ddcf-4f55-bac3-89f25931e91b"
      },
      "source": [
        "translate(translation, \"en\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'This is a test translation into Spanish'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 17
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4cSI8GdtjhEM"
      },
      "source": [
        "# Translating multiple languages in a single call\n",
        "\n",
        "The section below translates a single English sentence into 5 different languages. The results are then passed to a single translation call to translate back into English. The pipeline detects each input language and is able to load the relevant translation models."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8jLxGtwNf0Aj",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "47040b2f-f6e5-482a-df81-7b758f47a7d5"
      },
      "source": [
        "def run():\n",
        "  languages = [\"fr\", \"es\", \"de\", \"hi\", \"ja\"]\n",
        "  translations = [translate(\"The sky is blue, the stars are far\", language) for language in languages]\n",
        "  english = translate(translations, \"en\")\n",
        "\n",
        "  for x, text in enumerate(translations):\n",
        "    print(\"Original Language: %s\" % languages[x])\n",
        "    print(\"Translation: %s\" % text)\n",
        "    print(\"Back to English: %s\" % english[x])\n",
        "    print()\n",
        "\n",
        "# Run multiple translations\n",
        "run()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Original Language: fr\n",
            "Translation: Le ciel est bleu, les étoiles sont loin\n",
            "Back to English: The sky is blue, the stars are far away\n",
            "\n",
            "Original Language: es\n",
            "Translation: El cielo es azul, las estrellas están lejos.\n",
            "Back to English: The sky is blue, the stars are far away.\n",
            "\n",
            "Original Language: de\n",
            "Translation: Der Himmel ist blau, die Sterne sind weit\n",
            "Back to English: The sky is blue, the stars are wide\n",
            "\n",
            "Original Language: hi\n",
            "Translation: आकाश नीला है, तारे दूर हैं\n",
            "Back to English: Sky is blue, stars are away\n",
            "\n",
            "Original Language: ja\n",
            "Translation: 天は青い、星は遠い。\n",
            "Back to English: The heavens are blue and the stars are far away.\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Xn3SlVE1LYvm"
      },
      "source": [
        "The translation quality overall is very high!"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Additional model types\n",
        "\n",
        "The translation pipeline is flexible and supports multiple model types. The default mode for the pipeline is to scan the Hugging Face Hub for models that best match the source-target translation pair. This often produces the best quality and is usually a smaller model than a large multi-language mode.\n",
        "\n",
        "There is a parameter that can override this and always use the base model."
      ],
      "metadata": {
        "id": "3FdS5slz60eA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "translate = Translation(\"t5-small\", findmodels=False)\n",
        "translate(\"translate English to French: The sky is blue, the stars are far\", None)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "REb0X2Kz60Ew",
        "outputId": "ae13d4d0-318f-4740-f725-0029ceeabeac"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'Le ciel est bleu, les étoiles sont loin'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Translation isn't limited to spoken languages. txtai provides a text-to-sql model that converts English text into a txtai-compatible SQL statement. "
      ],
      "metadata": {
        "id": "5vSJbW7zVJeH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "translate = Translation(\"NeuML/t5-small-txtsql\", findmodels=False)\n",
        "translate(\"translate English to SQL: feel good story since yesterday\", None)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "d5usam_jhKaz",
        "outputId": "c12368a0-ea26-4191-be5a-f2de70711003"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "\"select id, text, score from txtai where similar('feel good story') and entry >= date('now', '-1 day')\""
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Last thing we'll do is run the multiple language example only using a single large language model."
      ],
      "metadata": {
        "id": "cVb7uk7TXr_v"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "translate = Translation(\"facebook/mbart-large-50-many-to-many-mmt\", findmodels=False)\n",
        "run()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JkOLnvKZWI95",
        "outputId": "fef6402c-e20e-43e1-a3eb-e4580913fa7e"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Original Language: fr\n",
            "Translation: Le ciel est bleu, les étoiles sont loin\n",
            "Back to English: The sky is blue, the stars are far away\n",
            "\n",
            "Original Language: es\n",
            "Translation: El cielo es azul, las estrellas están lejos.\n",
            "Back to English: The sky is blue, the stars are far away.\n",
            "\n",
            "Original Language: de\n",
            "Translation: Der Himmel ist blau, die Sterne sind weit\n",
            "Back to English: The sky is blue, the stars are far.\n",
            "\n",
            "Original Language: hi\n",
            "Translation: आकाश नीली है, तारे दूर हैं।\n",
            "Back to English: The sky is blue, and the stars are far away.\n",
            "\n",
            "Original Language: ja\n",
            "Translation: 空は青い、星は遠い\n",
            "Back to English: the sky is blue, the stars are far away.\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Wrapping up\n",
        "\n",
        "Machine translation has made giant leaps and strides the last couple of years. These models give developers a solid, locally-hosted alternative to cloud translation services. Additionally, there are models built for low resource languages that cloud translation services don't support.\n",
        "\n",
        "A number of different models and configurations are supported, give it a try!"
      ],
      "metadata": {
        "id": "TvWx9PS-X32c"
      }
    }
  ]
}